Cholinergic activity reflects reward expectations and predicts behavioral responses

Summary Basal forebrain cholinergic neurons (BFCNs) play an important role in associative learning, suggesting that BFCNs may participate in processing stimuli that predict future outcomes. However, the impact of outcome probabilities on BFCN activity remained elusive. Therefore, we performed bulk calcium imaging and recorded spiking of identified cholinergic neurons from the basal forebrain of mice performing a probabilistic Pavlovian cued outcome task. BFCNs responded more to sensory cues that were often paired with reward. Reward delivery also activated BFCNs, with surprising rewards eliciting a stronger response, whereas punishments evoked uniform positive-going responses. We propose that BFCNs differentially weigh predictions of positive and negative reinforcement, reflecting divergent relative salience of forecasting appetitive and aversive outcomes, partially explained by a simple reinforcement learning model of a valence-weighed unsigned prediction error. Finally, the extent of cue-driven cholinergic activation predicted subsequent decision speed, suggesting that the expectation-gated cholinergic firing is instructive to reward-seeking behaviors.


INTRODUCTION
Cholinergic neurons of the basal forebrain are important for associative learning. This idea is supported by the selective cholinergic cell loss that parallels cognitive decline in patients with Alzheimer disease. 1,2 Although lesion and pharmacology studies were confirmative, 3-5 they cannot solve how BFCNs exert their control over learning. To address the mechanisms of the contribution of BFCNs to associative learning, it is important to investigate the behavioral correlates of BFCN activity at temporal resolutions comparable to the time scales of behaviorally relevant events animals and humans encounter. 6,7 This has only become possible recently, enabled by the development of optogenetic and imaging tools. [8][9][10][11] Selective cholinergic lesions of the basal forebrain were shown to impair learning in rodents 12-18 and monkeys, 19 and lesions of the basal forebrain in consequence of aneurysm rupture of the anterior cerebral or anterior communicating artery lead to severe learning impairments in humans. 20 Previous studies of the basal forebrain have proposed that responses to behaviorally salient stimuli of cholinergic and/or noncholinergic basal forebrain neurons may underlie the involvement of the basal forebrain in learning. [9][10][11]21,22 Specifically, cholinergic activation may lead to increased cortical acetylcholine release that induces plastic changes in sensory responses. 23,24 A recent study connected the above pieces of evidence by bulk imaging of BFCNs during auditory fear learning. 11 However, it is not yet known how BFCNs process sensory cues with different predictive features during learning, which could serve as a basis for differential behavioral responses to sensory events that forecast distinct outcomes. Therefore, a comprehensive model of cholinergic neuronal responses that subserve associative learning is also lacking. We set out to fill this knowledge gap by recording cholinergic activity in a probabilistic Pavlovian cued outcome task, which allowed us to directly control outcome probabilities and cue-outcome contingencies during learning. 25 Of note, reward expectation can also be manipulated by reward size. 26,27 However, since we hypothesized that BFCNs are sensitive to the outcome probabilities, we chose to manipulate reward probability instead, despite that this is harder to learn, as animals have to integrate over multiple trials to infer differences in probability, whereas reward size can be learned from a single trial. 28 We imaged the bulk calcium responses of BFCNs using fiber photometry 29 and recorded the activity of identified basal forebrain cholinergic neurons while mice were performing a head-fixed auditory probabilistic Pavlovian cued outcome task. 25 BFCNs were activated by outcome-predicting stimuli, as well as Figure 1. Mice were trained on a probabilistic Pavlovian conditioning task (A) Schematic illustration of the behavioral training and block diagram of the task. A variable foreperiod, in which the mouse was not allowed to lick, was followed by the presentation of one of two pure tones of well-separated pitch, which predicted reward, punishment, or nothing with different contingencies (''likely reward'' and ''unlikely reward'' cues). (B) Raster plot of lick responses to the cues predicting likely reward (top) and unlikely reward (bottom) from an example session. Yellow shading, response window (RW); gray shading, reinforcement delivery (RD).   Figure 3D). Since these neurons exhibited similar responses to conditioned and unconditioned stimuli, they were treated as a single data set for this study; nevertheless, restricting data analyses to the HDB cholinergic neurons yielded similar results.
Large cholinergic responses to reward-predicting cues, surprising rewards, and air puff punishments We first asked whether individual BFCNs show spiking responses to auditory cue stimuli that predict outcomes with different probabilities. To address this, we aligned BFCN spikes to cue onset and examined raster plots and  iScience Article peri-event time histograms (PETHs) of individual BFCNs (see Figure S4 for a schematic representation of the analysis). We found that BFCNs responded to both auditory cues, with a median peak latency of 133.5 ms for the ''likely reward'' cue and 422 ms for the ''unlikely reward'' cue ( Figures 4A, 4B, and S5A; interquartile range, 44.5-231 ms and 273-573.5 ms for the two cue types). To cover both peaks, we chose a 500-ms response window (C500), in which we compared BFCN responses to conditioned cue stimuli based on whether they signaled high or low probability of future reward. BFCNs showed 151% stronger responses to the ''likely reward'' cues based on a comparison of PETH peak responses in the C500 window (p = 0.0008, Wilcoxon signed-rank test; Figure 4C; including n = 14 neurons where mice encountered >10 surprising reward trials; see Figure S6 for all n = 25 neurons), which we also confirmed by spike-number-based statistics (p = 0.00061, Wilcoxon signed-rank test on  iScience Article BFCN firing rates in the C500 window; Figure 4C). Thus, BFCNs responded more to sensory stimuli that signaled high probability of reward.
Next, we tested whether individual BFCNs responded to the delivery of reward during Pavlovian conditioning, and whether this response depended on previous expectations about reward likelihoods conveyed by the two auditory cues. Therefore, we aligned the spike times of the same BFCNs to the time of reward delivery, again examining raster plots and PETHs ( Figures 4D and 4E). We found that reward also elicited large BFCN responses, with a median peak latency of 86.5 and 82.7 ms for expected and surprising rewards, respectively ( Figure S5B; interquartile range, 78.13-100.25 ms and 54.5-92.5 ms for expected and surprising rewards). To compare BFCN responses to expected vs. surprising rewards, we defined a 200-ms response window after reward delivery based on the above latency measurements (R200). We found that rewards that were less expected lead to significantly stronger cholinergic firing (69.3%, p = 0.0245, Wilcoxon signed-rank test on R200 response peaks; Figure 4F), also confirmed by firing rate comparison (p = 0.02026, Wilcoxon signed-rank test on BFCN firing rates in the R200 window; Figure 4F). These findings showed that BFCN responses were modulated by the expectation of reward.
We took a similar approach to investigate BFCN responses to the delivery of air puff punishments. BFCNs also responded with firing rate increase to punishment, with remarkably short peak latencies (Figures 4G, 4H, and S6C; median and interquartile range, 24.5 ms and 15.5-36 ms for surprising punishment and 24 ms and 15.5-32 ms for expected punishment), confirming previous results. 8,10,34 When responses to surprising and expected punishments were directly compared in a 200-ms response window (P200), we did not find significant modulation by expectation (p = 0.7869, Wilcoxon signed-rank test on peak responses; Figure 4I; p = 0.8393, Wilcoxon signed-rank test on firing rates). We did not detect significant firing rate changes in either direction after omissions ( Figure S7).
Cholinergic responses are explained by a reinforcement learning model of stimulus-driven, valence-weighed, unsigned prediction error The above-demonstrated differential BFCN responses to conditioned and unconditioned stimuli that reflected outcome expectations were suggestive of prediction error coding. 35 Based on the positive-going BFCN responses following both reward and punishment, we assumed that BFCNs might represent an unsigned prediction error. If an outcome prediction error scaled positive and negative values equally, then it iScience Article would track the expectation of reinforcement irrespective of valence. Thus, it would predict identical responses to conditioned cue stimuli that foreshadow reinforcement with a fixed probability, only sensitive to the rate of reinforcement omissions. However, cholinergic neurons showed stronger responses after cues predicting likely reward compared with those predicting unlikely reward but likely punishment. Therefore, our results suggest that BFCNs assign different weights to expected positive and negative outcomes, potentially related to the difference in absolute subjective values of the reinforcers. We did not observe BFCN responses to reinforcement omissions, suggesting that BFCN responses are driven by sensory stimuli, and thus a stimulus-driven, valence-weighed, unsigned prediction error model could explain BFCN spiking dynamics.
To test this, we implemented and fitted a simple three-parameter reinforcement learning (RL) model 35,36 on cholinergic responses: where C represented cholinergic response, S was a scaling parameter accounting for different mean firing rates of BFCNs, R and P were actual, while E(R) and E(P) were expected reward and punishment determined by task contingencies. To take the assumed difference in the relative sensitivity to water reward and air-puff punishment into account, we introduced two weight parameters, h 1 and h 2 (0 % h 1 , h 2 % 1), which could control how much BFCN responses were influenced by the expectation of positive and negative outcomes, respectively. Taking the absolute value of the sum of reward and punishment prediction error terms ensured positive-going cholinergic responses irrespective of valence, thus resulting in a simple model of unsigned reward prediction error. We found that this model fitted BFCN firing rate changes in response to the different cues and reinforcers defined by the C500, R200, and P200 response windows well (Figures 5A-5C), and significantly better than a control model in which the modeled expectations did not match the task contingencies (p = 0.0014 for all n = 25 BFCNs recorded; p = 0.0037 when only HDB cholinergic neurons were tested; Wilcoxon signed-rank test on the maximum likelihoods of the models; see STAR Methods).
We next simulated spike trains of individual BFCNs based on the best-fit RL models. Baseline firing was modeled by a Poisson process with a frequency matched to the baseline firing rate of the modeled BFCN, and simulated firing responses were added according to Gaussian distributions with a fixed delay after cue and reinforcement events, where the number of added spikes was determined by the best-fit RL model for each BFCN. When applying the same analyses on simulated spike trains as for the real data, we found that simulated PETHs qualitatively reproduced BFCN responses to cues and rewards ( Figure 5D). These results further strengthen that the BFCN responses we observed are consistent with the representation of a stimulus-induced, valence-weighed, unsigned prediction error.
The best-fit h 1 values were significantly larger than the best-fit h 2 values, demonstrating stronger sensitivity of BFCN responses to reward than to punishment expectations (p = 0.0001, Wilcoxon signed-rank test; median GSE of median, h 1 , 0.61 G 0.04, h 2 , 0.37 G 0.05). At the same time, the best-fit h 2 values were significantly above 0.2, suggesting that mice learned to predict negative outcomes as well, reflected in their cholinergic responses according to the model (p = 0.0058, Wilcoxon signed-rank test). These parameters might reflect potential differences in the internal valuation of water reward and air puff punishment across animals and recording days, and also different sensitivity to reward expectation of individual BFCNs. We hypothesized that they reflect behavioral variability rather than heterogeneity across neurons, which would imply that these parameters show consistency within recording sessions and within individual mice. Indeed, we found smaller within-than across-mice differences in best-fit h 1 parameters (p = 0.002, Mann-Whitney U test), and smaller within-than across-session differences in best-fit h 2 parameters (p = 0.047, Mann-Whitney U test; n = 25; Figure S8). This suggests that best-fit scaling parameters for outcome expectations reflect inter-individual and/or behavioral differences, rather than differential sensitivity of individual BFCNs.
The perceived reward and punishment prediction errors are controlled by h 1 and h 2 in our model; therefore, they together determine the size of the unsigned outcome prediction error represented by BFCNs. If this can drive approach behaviors as previous studies suggested, 21,37-39 then we would expect that the animals' anticipatory licking behavior correlates with these model parameters. Indeed, we found that h 1 as well as the sum of the two parameters (h 1 + h 2 ), characterizing the cholinergic neurons' sensitivity to momentary outcome prediction, correlates well with behavioral cue differentiation as indexed by anticipatory lick rate difference (p = 0.012, R = 0. 52

Cholinergic responses predict reaction time
The correlation of model parameters quantifying the animals' sensitivity to outcome expectations with behavioral performance prompted us to further assess whether BFCN responses could predict animal iScience Article behavior. BFCN responses to outcome-predicting cues consistently preceded the animals' first licks ( Figures 6A and 6B). When we aligned cholinergic spikes to the last lick before the foreperiod during which mice were not allowed to lick, cholinergic activity peaked before licking with a similar time course as during cue-related licking activity ( Figure S9). These findings excluded that a potential ''lick-driven'' cholinergic activity could confound the results and instead indicated that cholinergic activity had the potential to influence behavioral responses of mice performing the task. Indeed, we found that larger cholinergic cue responses were followed by faster reactions (p = 0.00073 and p = 0.05108 for ''likely reward'' and ''unlikely reward'' cues, respectively; Wilcoxon signed-rank test; Figure 6C). In accordance, cholinergic cue responses were larger when mice were licking after the cue (p = 0.048 and p = 0.023 for ''likely reward'' and ''unlikely reward'' cues, respectively; Wilcoxon signed-rank test; Figure 6D). Since lick responses can be taken as an indication of mice expecting reward, these results are consistent with cholinergic reward expectation coding. Next, we divided the trials into four quartiles according to mice's reaction times after cue onset. In line with the above results, we found that faster lick responses were preceded by stronger cholinergic firing ( Figure 6E, p = 0.0314, one-way ANOVA). This was also reflected in a significant negative trial-by-trial correlation of BFCNs' firing rate after the reward-predicting cue and animal reaction time (R = À0.45, p = 0.034; Pearson's correlation coefficient, linear regression and one-sided F-test). In sum, these results indicate that cue responses of BFCNs predict reaction times, suggesting that cholinergic outcome prediction coding affects behavioral responses.

DISCUSSION
Cholinergic neurons of the basal forebrain respond to behaviorally salient events. [8][9][10][11]22,34,40,41 To better understand the nature of these responses, we investigated whether activity patterns elicited by outcome-predictive stimuli and behavioral feedback are consistent with a prediction error hypothesis. By investigating the responses of BFCN populations using bulk calcium imaging and of individual BFCNs by optogenetic tagging in a probabilistic Pavlovian cued outcome task, we found that BFCNs showed strong activation after reward-predicting stimuli, and larger responses to surprising than to expected rewards. These results were consistent with a simple RL model of a stimulus-driven, valenceweighed, unsigned reward prediction error. The model also demonstrated that while BFCNs responded with firing rate increase to events of both positive and negative valence, they also reflected different behavioral sensitivity to positive and negative expectations. Finally, BFCN responses were found to likely influence behavioral performance, as mice showed faster responses after stronger cholinergic activation.
Temporal difference reinforcement learning (TDRL) models were successful in explaining the reward prediction errors represented by the dopaminergic system. 35,36, 42 The presence of a reward response modulated by expectation and responsiveness to reward-predicting sensory stimuli suggested that cholinergic signals may also be related to prediction errors; however, consistently positive-going responses for punishment indicated that this prediction error signal may be unsigned. A model that puts the same positive weight on both aversive and appetitive outcomes tracks the expectation of a reinforcement irrespective of its valence; therefore, it would predict identical responses for reward-and punishment-predicting cues if omission rate is constant. However, BFCNs clearly preferred reward-predicting stimuli, suggesting differential representation of rewards and punishments. Therefore, we implemented an RL model and fitted parameters capturing differential weighing based on valence. We found that this model reliably predicted average BFCN responses to cues, rewards, and punishments. It also reproduced larger BFCN responses to cues that foreshadowed rewards with high probability, as well as to surprising, when compared with expected rewards. The best-fit model indicated non-zero weights for both reward and punishment expectation, suggesting the behavioral anticipation of both types of outcomes, but with a significantly larger weight on the expectation of reward, suggesting that the outcome prediction error cholinergic neurons represented was indeed unequally weighed.
What may be the function of this fast prediction error signal? The cholinergic system has long been known to strongly influence cortical plasticity. 3, [43][44][45][46][47] A line of studies has demonstrated that pairing auditory stimuli with cholinergic stimulation reorganizes cortical sensory representations, known by the term ''receptive field plasticity.'' 23,24 Furthermore, recent studies showed that cholinergic inputs may even endow primary sensory cortices with non-sensory representations not expected previously. 48,49 In particular, Liu et al. showed that optogenetic activation of cholinergic fibers in the visual cortex entrained neural responses that mimicked behaviorally conditioned reward timing activity. 50 It was also demonstrated that the cholinergic system exerts a rapid, fine-balanced control over plasticity at millisecond timescales, stressing the importance of timing ll OPEN ACCESS iScience 26, 105814, January 20, 2023 iScience Article even for neuromodulatory systems. [51][52][53] This effect on plasticity might have a fundamental impact on associative learning at the behavioral level, 54 also suggested by recent advances in the fear learning field. 11,22,55,56 Indeed, we found that the best-fit model parameters were correlated with the difference in the animals' anticipatory lick rate indicating learning performance. Moreover, cholinergic responses to reward-predicting cues predicted behavioral responses and reaction time, fitting in a more general scheme of basal forebrain control over response speed to motivationally salient stimuli. 21,37,45 Therefore, we propose that a iScience Article rapid acetylcholine-mediated cortical activation, scaled by unsigned outcome prediction error, tunes synaptic plasticity in the service of behavioral learning. This idea is supported by strong theories that associated prediction errors and cholinergic activity with learning and memory. [57][58][59] Nevertheless, the functions of cholinergic effects probably go beyond learning, and BFCNs may control many aspects of behavior including arousal or alertness, 21,31,40,45,60-64 attention, 3,54,65-67 and vigilance. [68][69][70] The activity of cholinergic neurons shares strong similarities with dopaminergic neurons in response to reward and reward-predicting cues. 35,71-73 Reward-predicting cues evoke an increase in firing rate, which is stronger for more likely rewards. Reward itself also elicits cholinergic firing, but less so if the reward is more expected. However, cholinergic neurons differ from dopaminergic neurons in their response to punishment. Dopaminergic neurons can respond to aversive stimuli with either increased or decreased firing, [74][75][76] whereas cholinergic neurons consistently respond with a fast, precisely timed response to air puffs. Therefore, the positive-going response of BFCNs irrespective of valence, sensitive to outcome probabilities for cues and rewards, suggests that compared with the reward prediction error signal dopaminergic neurons encode, BFCNs represent an unsigned outcome prediction error. Importantly, bursting basal forebrain neurons with similar coding properties have been uncovered in primates, 39 suggesting that at least part of those neurons might be cholinergic.
Cholinergic neurons appeared to respond faster than dopaminergic neurons; however, response timing may depend on seemingly subtle details of the behavioral paradigm. Altogether, BFCNs appear to provide a faster but less specific response to salient stimuli, which is likely broadcasted to large cortical areas innervated by cholinergic fibers. 77,78 In contrast, calculations related to value that are represented in the dopaminergic system may require more processing time and result in somewhat delayed, albeit more specific representations. Nevertheless, direct comparisons of cholinergic and dopaminergic neurons in the same experiment will be necessary to reveal the differential functions of these major neuromodulatory systems.

Limitations of the study
A valence-weighed unsigned prediction error hypothesis predicts stronger response to unexpected than to expected punishment; the larger the weight of the punishment, the stronger the difference. We did not find a significant difference, which could be due to the lower weight of punishment that decreased statistical power, or a theoretical deviation from a full-fledged outcome prediction error.
Additionally, an unsigned prediction error signal predicts a firing rate increase after omitted reward. Note that unsigned prediction error variables only take non-negative values, and thus all unexpected changes in state value result in increased values due to the absolute value operator; however, different models that predict a firing rate decrease after omission are also conceivable. We tested an alternative model that included omission-related activity (see STAR Methods), but we found that this model was statistically indistinguishable from our original model based on our data set, suggesting that larger amounts of data are required to resolve this question. Also, given the phasic nature of cholinergic reinforcement responses comprising often very few (sometimes only one) but precisely timed action potentials, 10 it is expected that an omission response, where there is no sensory stimulus to align to, is very hard to detect in single neurons. Indeed, a recent study demonstrated positive-going omission responses in HDB cholinergic neurons using fiber photometry. 41 Alternatively, the cholinergic system may be sensitive to external sensory stimuli but not to absence of an expected stimulus, in line with its strong bottom-up anatomical inputs conveying sensory signals, 79 likely gated via local inhibitory neurons that may relay expectation information. 80 A recent study demonstrated topographic variations in cholinergic responses to salient events, 41 which could also contribute to these ambiguities.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Balá zs Hangya (hangya.balazs@koki.hu).

Materials availability
This study did not generate new unique reagents.
Data and code availability d Electrophysiology and fiber photometry data have been deposited to a Dryad repository at https://doi. org/10.5061/dryad.p5hqbzkrv. d MATLAB codes generated for this study are available at https://github.com/hangyabalazs/ cholinergic_Pavlovian_analysis. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Tetrode implantation surgery
Mice were implanted using standard stereotaxic surgery techniques with miniaturized microdrives housing 8 tetrodes and an optic fiber. 10,33,81 Briefly, mice were anesthetized by a mixture of ketamine and xylazine (83 and 17 mg/kg, respectively, dissolved in 0.9% saline). The skin was shaved and disinfected with Betadine, subcutaneous tissues were infused with Lidocaine, eyes were protected with eye ointment (Laboratories Thea) and mice were placed in a stereotaxic frame (Kopf Instruments). The skull was cleaned, and a craniotomy was drilled above the horizontal diagonal band of Broca (HDB, antero-posterior 0.75 mm, lateral 0.60 mm; n = 4) or the medial septum (MS, antero-posterior 0.90 mm, lateral 0.90 mm, 10 degrees lateral angle; n = 1). Virus injection (AAV2/5.EF1a.Dio.hChR2(H134R)-eYFP.WPRE.hGH; HDB, dorso-ventral 5.00 and 4.70 mm, 300 nL at each depth; MS, dorsoventral 3.95, 4.45 and 5.25 mm, 200 nL at each depth) and drive implantation was performed according to standard techniques. 10,33 Ground and reference electrodes were implanted to the bilateral parietal cortex. Mice received analgesics (Buprenorphine, 0.1 mg/kg), local antibiotics (Neomycin) and were allowed 10 days of recovery before starting behavioral training.

Behavioral training
Mice were trained on a head-fixed probabilistic auditory Pavlovian conditioning task 28 in a custom-built behavioral setup that allowed millisecond precision of stimulus and reinforcement delivery (described in 7 ). Mice were water restricted before training and worked for small amounts of water reward (5 mL) during conditioning. Pure tones of one second duration predicted likely reward/unlikely punishment or unlikely reward/likely punishment based on their pitch (12 kHz tone predicted 80% water reward, 10% air-puff punishment, 10% omission; 4 kHz tone predicted 25% water reward, 65% air-puff punishment, 10% omission in n = 6 mice; opposite cue contingencies were used in n = 5 mice, see Figures S1 and S3; 50-50% of the two cue tones were mixed randomly; all cue tone intensities were set at 50 dB sound pressure level). The animal was free to lick a waterspout after tone onset and individual licks were detected by the animal's tongue breaking an infrared photobeam. After an additional 200-400 ms post-stimulus delay, the animal received water reward, air-puff punishment or omission, pseudorandomized according to the above contingencies. The next trial started after the animal stopped licking for at least 1.5 s. The stimulus was preceded by a 1-4 s foreperiod according to a truncated exponential distribution, in order to prevent temporal expectation of stimulus delivery. If the mouse licked in the foreperiod, the trial was restarted. We used the open source Bpod behavioral control system (Sanworks LLC, US) for operating the task. Behavioral performance of the task did not depend on the identity (frequencies) of the conditioned stimuli ( Figure S1).
The aversive quality of air-puffs depends on the exact experimental settings. We applied 200 ms long puffs at 15 psi pressure (within the range of parameters used for eyeblink conditioning 82 ). We demonstrated that mice consistently choose water without air-puff over water combined with air-puff, showing that air-puffs are aversive under these circumstances (see Figures 2C and 2D in 10 ). We also demonstrated that water and air-puff are accompanied by different auditory signals in our setup, thus making sensory response generalization unlikely to explain BFCN responses (see Figure S2A in 28 ).

Fiber photometry imaging
Bilateral fluorescent calcium imaging was performed using a dual fiber photometry setup (Doric Neuroscience) and visualized during training sessions using Doric Studio Software. Two LED light sources (465 nm, 405 nm) were channeled in fluorescent Mini Cubes (iFMC4, Doric Neuroscience). Light was amplitude-modulated by the command voltage of the two-channel LED driver (LEDD_2, Doric Neuroscience, the 465 nm wavelength light was modulated at 208 Hz and 405 nm wavelength was modulated at 572 Hz). Light was channeled into 400 mm diameter patch cord fibers and was connected to optical fiber implants during training sessions.

OPEN ACCESS
iScience 26, 105814, January 20, 2023 iScience Article The same optical fibers were used to collect the bilateral emitted fluorescence signal, which were detected with 500-550 nm fluorescent detectors integrated in the Mini Cubes. Emitted signals were sampled at 12 kHz, decoded in silico and saved in a *.csv format.

Chronic extracellular recording
We used the open ephys data acquisition system 83 for spike data collection. A 32-channels Intan headstage (RHD2132) was connected to the Omnetics connector on the custom-built microdrive. Data was transferred through digital SPI cables (Intan) to the Open Ephys board and saved by the Open Ephys software, digitized at 30 kHz.

Optogenetic tagging
The custom microdrives were equipped with a 50 mm core optic fiber (Thorlabs) that ended in an FC connector (Precision Fiber Products). This was connected with an FC-APC patch chord during recording. For optogenetic tagging, 1 ms laser pulses were delivered (473 nm, Sanctity) at 20 Hz for 2 seconds, followed by 3 seconds pause, repeated 20-30 times. Light-evoked spikes and potential artifacts were monitored online using the OPETH plugin (SCR_018022) 84 and laser power was adjusted as necessary to avoid light-induced photoelectric artifacts and population spikes that could mask individual action potentials. Significance of photoactivation was assessed during offline analyses by the SALT test based spike latency distributions after light pulses, compared to a surrogate distribution using Jensen-Shannon divergence (information radius). 33,85 Neurons with p < 0.01 were considered light-activated, and thus cholinergic. Cholinergic neurons recorded on the same tetrode within 200 mm dorso-ventral distance were compared by waveform correlation and autocorrelogram similarity, 28,86 and similar units were counted towards the sample size only once.

Histology
After the last behavioral session, mice were deeply anesthetized with ketamine/xylazine and we performed an electrolytic lesion to aid electrode localization (5 mA current for $5s on 2 leads/tetrode), Supertech, IBP-7c). Mice were perfused transcardially, starting with a 2-minute washout period with saline, followed by 4% paraformaldehyde solution for 20 minutes. After the perfusion, mice were removed from the platform and decapitated. The brain was carefully removed and postfixed overnight in 4% PFA. A block containing the full extent of the HDB was prepared and 50 mm thick sections were cut using a Leica 2100S vibratome. All attempts were made to section parallel to the canonical coronal plane to aid track reconstruction efforts. All sections that contained the electrode tracks were mounted on slides in Aquamount mounting medium. Fluorescent and dark field confocal images of the sections were taken with a Nikon C2 confocal microscope. During track reconstruction, it is important to convert the logged screw turns (20 mm for each one eights of a full turn, allowed by a 160 mm pitch custom precision screw, Easterntec, Shanghai) that were performed throughout the experiment into brain atlas coordinates with maximal possible precision. To this end, dark field and bright field images of the brain sections were morphed onto the corresponding atlas planes 87 using Euclidean transformations only. The aligned atlas images were carried over to fluorescent images of the brain sections showing the DiI-labelled electrode tracks (red) and green fluorescent labeling (cholinergic neurons labelled by the AAV2.5-EF1a-Dio-hChR2(H134R)-eYFP.WPRE.hGh virus) in the target area. The entry points, electrode tips and lesion sites were localized with respect to the atlas coordinates maximizing the combined information of the structural (dark/bright field), DiI track and ChATlabelling fluoromicrographs. Recording location of each section was interpolated based on the above coordinates, using the screw turn logs and the measured protruding length of the tetrodes after the experiments (also described in 10 ). If the track spanned multiple sections, special care was taken to precisely reconstruct the part of the track where the recordings took place within the target area. This procedure minimizes the localization errors that may arise from differences between the recorded and the reference brain coordinates and eliminates the effect of tissue distortions caused by the fixation process. Only those recordings that were convincingly localized to the basal forebrain were analyzed in this study.

QUANTIFICATION AND STATISTICAL ANALYSIS
Data processing and analysis was carried out in Matlab R2016a (Mathworks, Natick). to remove the effect of motion and autofluorescence. Slow decay of the baseline activity was filtered out with an 0.2 Hz high pass Butterworth digital filter. Finally, the dff signal was triggered on cue and feedback times, Z-scored by the mean and standard deviation of a baseline window (1s before cue onset) and averaged across trials.

Data analysis
Tetrode recording channels were digitally referenced to a common average reference, filtered between 700-7000 Hz with Butterworth zero-phase filter and spikes were detected using a 750 ms censoring period.
Spike sorting was carried out in MClust 3.5 software (A .D. Redish). Autocorrelations were inspected for refractory period violations and putative units with insufficient refractory period were not included in the data set. Cluster separation was measured using Isolation Distance and L-ratio calculated on the basis of two features, the full spike amplitude and the first principle component of the waveform. 88 72 We did not find any systematic differences in our analyses based on anatomical location; thus, we analyzed the 25 neurons as one dataset. First, we calculated event-aligned raster plots and peri-event time histograms (PETHs) for all neurons. To calculate average PETHs, neuronal responses were triggered on cue and feedback times, Z-scored by the mean and standard deviation of a baseline window (1s before cue onset) and averaged across trials. Response latency and jitter to optogenetic stimulation and behaviorally relevant events were determined based on activation peaks in the peri-event time histograms. 10 Behavioral performance was tested by comparing the anticipatory lick rate after reward and punishment predicting stimuli in a 1.2 s time window after stimulus onset. Reaction time was determined as the latency of the first lick after stimulus presentation.
We would like to note that an initial analysis of a part of this data set was presented in a bioRxiv preprint (https://www.biorxiv.org/content/10.1101/2020.02.17.953141v1).

Model fitting
Firing rates of cholinergic neurons were calculated in 500 ms response windows after cue presentation and 200 ms response windows after reinforcement presentation, to include the full firing response based on the observed time course of cholinergic activation (Figure 4). Firing rates were fitted by the following modified temporal difference RL model of cholinergic activity (C).
In this equation, R À h 1 EðRÞ stands for reward prediction error (RPE). RPE classically takes the formula of R À EðRÞ, where E(R) is expected, and R is actual amount of reward at a given time point. This was modified by the h 1 parameter, allowing potential differences in sensitivity to reward expectation across animals, sessions and neurons. Similarly, P À h 2 EðPÞ represents the difference of expected and encountered punishment, referred to as 'punishment prediction error' hereafter. The two terms sum up to a full outcome prediction error, rendered 'unsigned' by the absolute value operator. The scaling factor S accounts for differences in baseline firing rate of cholinergic neurons. Of note, we found a variable mean firing rate of cholinergic neurons with an average of 8.22 G 11.39 (SD) Hz during Pavlovian conditioning. We allowed the model to account for this difference with a scalar factor. This implicitly assumes response magnitudes proportional to 'baseline' firing rate, as in a multiplicative gain model, similar to what was found for ll OPEN ACCESS iScience 26, 105814, January 20, 2023 21 iScience Article dopamine neurons. 26 The temporal discounting factor inherent to TDRL models was omitted from the equation, as it leads to only negligible firing rate differences within the few seconds of time that spans a behavioral trial. We note that another way of incorporating differential responsiveness to reward and punishment expectations would be by adding classical learning rates in the form of C = S$a 1 jR À EðRÞj + a 2 jP À EðPÞj We fitted this alternative model as well; however, this model failed to capture the relative ratios of cue and outcome responses of cholinergic neurons and thus resulted in worse fits than the model presented in Figure 5.
The model was evaluated for the time of the cue, reward and punishment presentations. At the time of cues, R = P = 0, therefore the model takes the form of C = S$½h 1 EðRÞ + h 2 EðPÞ Since no omission responses were observed in cholinergic recordings ( Figure S7), we dropped the negative expectation term of the omitted reinforcer at the time of reinforcement (e.g. omitted reward response at the time of punishment), leading to C = S$½R À h 1 EðRÞ at reward (P = 0) and C = S$½P À h 2 EðPÞ at punishment (R = 0) delivery. Nevertheless, keeping the omission responses in the model resulted in fits that were statistically indifferentiable (p = 0.86, Wilcoxon signed-rank test of model errors), suggesting that our data were not sufficient to differentiate between RL models with or without omission responses. The E(R) and E(P) expectation terms were set according to the task contingencies (E(R) = 0.8 or 0.25 and E(P) = 0.1 or 0.65 for the likely reward and unlikely reward cues, respectively). As a control model, we ran the same fitting process after these contingencies for reward and punishment expectations were swapped (E(R) = 0.25 or 0.8 and E(P) = 0.65 or 0.1 for the likely reward and unlikely reward cues, respectively). Fitting error was estimated by the maximum likelihood method and minimized by using the fminsearch built-in Matlab function employing the Nelder-Mead simplex algorithm. Models were statistically compared by Wilcoxon signed-rank test on the maximum likelihoods. Note that the compared models had equal complexity and number of parameters; therefore, a punishment term for free parameters was not required. Correlation between model parameters and anticipatory lick rate difference was calculated using the builtin robust regression algorithm of Matlab. Confidence intervals were derived using the polypredci.m function (Star Strider, https://www.mathworks.com/matlabcentral/fileexchange/57630-polypredci, MATLAB Central File Exchange, retrieved December 30, 2020).
Spike trains were simulated as Poisson-processes matched to each recorded cholinergic neuron in frequency (n = 25). Cholinergic responses to cue, reward and punishment were simulated as additional spikes drawn from a Gaussian distribution with fixed latency after the events. The number of 'evoked spikes' was based on the best-fit RL model corresponding to each neuron. Peri-event time histograms were generated from simulated spike trains the same way as applied for real data.

Statistics
We estimated the sample size before conducting the study based on previous publications, mostly Hangya et al. (2015), 10 as reported in the Results. Firing rates and other variables were compared across conditions using non-parametric tests, as normality of the underlying distributions could not be determined. Twosided Wilcoxon signed-rank test was applied for paired, and two-sided Mann-Whitney U-test was applied for non-paired samples. Correlations were estimated by the Pearson's correlation coefficient, and their significance were judged by using a standard linear regression approach (one-sided F-test, in accordance with the asymmetric null hypothesis of linear regression). The relationship between BFCN firing rate and reaction time quartiles was also assessed by one-way ANOVA. Model fits were compared by negative log likelihood. Since the models compared had equal number of parameters, this is mathematically equivalent with model selection approaches using information criteria (e.g. Akaike and Bayesian Information Criterion). Peri-event time histograms show mean G SE. Box-whisker plots show median, interquartile range and non-outlier range, with all data points overlaid.